Building Customer Lifetime Models

In this workbook we investigate different ways to model the lifetime of individual customers, with a view to expanding this approach with our frequency into a more traditional P/NBD model.

1 Load and Configure Datasets

We first want to load some synthetic transaction data with a fixed transaction frequency which allowed for a varying lifetime.

customer_cohort_tbl <- read_rds("data/synthdata_lifetime_cohort_tbl.rds")
customer_cohort_tbl %>% glimpse()
## Rows: 50,000
## Columns: 4
## $ customer_id    <chr> "C201101_0001", "C201101_0002", "C201101_0003", "C20110…
## $ cohort_qtr     <chr> "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", …
## $ cohort_ym      <chr> "2011 01", "2011 01", "2011 01", "2011 01", "2011 01", …
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-0…
customer_simparams_tbl <- read_rds("data/synthdata_lifetime_simparams_tbl.rds")
customer_simparams_tbl %>% glimpse()
## Rows: 50,000
## Columns: 9
## $ customer_id     <chr> "C201101_0001", "C201101_0002", "C201101_0003", "C2011…
## $ cohort_qtr      <chr> "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1",…
## $ cohort_ym       <chr> "2011 01", "2011 01", "2011 01", "2011 01", "2011 01",…
## $ first_tnx_date  <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-…
## $ customer_mu     <dbl> 0.087383599, 0.009023485, 0.113799661, 0.016847819, 0.…
## $ customer_tau    <dbl> 13.4008628, 17.2527509, 0.9406138, 155.2154564, 61.892…
## $ customer_lambda <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,…
## $ customer_nu     <dbl> 0.94528278, 1.76830099, 1.39246239, 0.36050074, 0.1393…
## $ customer_p      <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,…
customer_transactions_tbl <- read_rds("data/synthdata_lifetime_transactions_tbl.rds")
customer_transactions_tbl %>% glimpse()
## Rows: 135,337
## Columns: 5
## $ customer_id   <chr> "C201101_0001", "C201101_0002", "C201101_0002", "C201101…
## $ cohort_qtr    <chr> "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "…
## $ cohort_ym     <chr> "2011 01", "2011 01", "2011 01", "2011 01", "2011 01", "…
## $ tnx_timestamp <dttm> 2011-01-01 14:14:36, 2011-01-01 06:38:12, 2011-01-23 19…
## $ tnx_amount    <dbl> 100.25, 52.76, 52.22, 80.54, 315.04, 289.44, 290.18, 309…

We also want to set up a number of parameters for use in this workbook

stan_modeldir <- "stan_models"
stan_codedir  <-   "stan_code"

1.1 Calculate Summary Statistics

We also want to calculate the summary statistics for each of these customers based on an observation date of Jan 1 2020.

customer_summarystats_tbl <- customer_transactions_tbl %>%
  calculate_transaction_cbs_data(
    last_date = as.Date("2020-01-01")
    )

customer_summarystats_tbl %>% glimpse()
## Rows: 50,000
## Columns: 6
## $ customer_id    <chr> "C201101_0001", "C201101_0002", "C201101_0003", "C20110…
## $ first_tnx_date <dttm> 2011-01-01 14:14:36, 2011-01-01 06:38:12, 2011-01-01 1…
## $ last_tnx_date  <dttm> 2011-01-01 14:14:36, 2011-01-23 19:21:11, 2011-01-01 1…
## $ x              <dbl> 0, 1, 0, 19, 4, 2, 1, 0, 2, 2, 0, 0, 0, 1, 0, 0, 3, 0, …
## $ t_x            <dbl> 0.000000, 3.218550, 0.000000, 151.009731, 23.709454, 23…
## $ T_cal          <dbl> 469.4866, 469.5319, 469.5039, 469.4506, 469.5086, 469.5…

1.2 Visualise the Transaction Data

As always, we want to visualise this data as much as we can to get a feel for it, so we construct a plot of each customer.

tnx_plot_tbl <- customer_transactions_tbl %>%
  group_nest(customer_id, .key = "tnx_data") %>%
  filter(map_int(tnx_data, nrow) > 1) %>%
  slice_sample(n = 50) %>%
  unnest(tnx_data)

ggplot(tnx_plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
  geom_line() +
  geom_point() +
  labs(
    x = "Date",
    y = "Customer ID",
    title = "Visualisation of Transactions for Sample of Customers"
    ) +
  theme(
    axis.text.y = element_text(size = 8)
    )

As part of the lifetime analysis, we need to take a look at times between transactions and we use this as part of our analysis later.

For the moment, we just want to see how the time differences work.

customer_timediffs_tbl <- customer_transactions_tbl %>%
  group_by(customer_id) %>%
  mutate(
    tnx_timediff = difftime(
        tnx_timestamp, lag(tnx_timestamp),
        units = "weeks") %>%
      as.numeric()
    ) %>%
  ungroup()

ggplot(customer_timediffs_tbl %>% drop_na(tnx_timediff)) +
  geom_histogram(aes(x = tnx_timediff), bins = 50) +
  scale_y_continuous(labels = label_comma()) +
  labs(
    x = "Weeks",
    y = "Count",
    title = "Histogram of Weeks Between Successive Transactions"
    )

We can see that almost all time differences are less than 75 weeks, but it is worth looking at a summary of the raw data.

customer_timediffs_tbl %>% pull(tnx_timediff) %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.86    4.53    6.84    9.32   85.67   50000

Considering this data, we can say that a customer has ‘died’ in our model 100 weeks after the most recent transaction. We can use this information in our model.

1.3 Construct the Model Data

We take a analagous approach for our lifetime model as we did for the frequency model. We take a subset of the data and use the rest of the data to validate the model.

We also need to exclude all customers with just a single transaction on our system as the first transaction is considered the ‘birth’ of the customer and it is second and subsequent transactions that help us calculate statistics.

break_date <- as.Date("2018-01-01")
full_weeks <- difftime(break_date, as.Date("2018-01-01"), units = "weeks") %>%
  as.numeric()

training_tbl <- customer_transactions_tbl %>%
  filter(
    tnx_timestamp <= break_date
    ) %>%
  group_nest(customer_id, .key = "tnx_data") %>%
  filter(map_int(tnx_data, nrow) > 1) %>%
  slice_sample(n = 10000) %>%
  unnest(tnx_data)

training_tbl %>% glimpse()
## Rows: 40,058
## Columns: 5
## $ customer_id   <chr> "C201610_0076", "C201610_0076", "C201610_0076", "C201610…
## $ cohort_qtr    <chr> "2016 Q4", "2016 Q4", "2016 Q4", "2016 Q4", "2015 Q1", "…
## $ cohort_ym     <chr> "2016 10", "2016 10", "2016 10", "2016 10", "2015 02", "…
## $ tnx_timestamp <dttm> 2016-10-05 04:02:25, 2016-10-06 13:50:03, 2016-10-08 10…
## $ tnx_amount    <dbl> 125.56, 109.38, 116.17, 84.56, 140.57, 181.75, 168.09, 1…
test_tbl <- customer_transactions_tbl %>%
  semi_join(training_tbl, by = "customer_id") %>%
  filter(
    tnx_timestamp > break_date
    )

test_tbl %>% glimpse()
## Rows: 2,694
## Columns: 5
## $ customer_id   <chr> "C201103_0202", "C201103_0202", "C201103_0202", "C201103…
## $ cohort_qtr    <chr> "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "2011 Q1", "…
## $ cohort_ym     <chr> "2011 03", "2011 03", "2011 03", "2011 03", "2011 03", "…
## $ tnx_timestamp <dttm> 2018-03-08 23:43:06, 2018-04-27 18:58:50, 2018-09-12 16…
## $ tnx_amount    <dbl> 57.28, 65.02, 71.91, 61.95, 60.90, 63.58, 67.77, 66.26, …

We need to reconstruct the summary statistics

training_stats_tbl <- training_tbl %>%
  calculate_transaction_cbs_data(
    last_date = break_date
    ) %>%
  mutate(
    min_lifetime = t_x,
    max_lifetime = pmin(t_x + 100, T_cal)
    )

training_stats_tbl %>% glimpse()
## Rows: 10,000
## Columns: 8
## $ customer_id    <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C20110…
## $ first_tnx_date <dttm> 2011-01-01 06:38:12, 2011-01-01 20:17:36, 2011-01-01 0…
## $ last_tnx_date  <dttm> 2011-01-23 19:21:11, 2013-11-23 21:55:42, 2011-06-11 0…
## $ x              <dbl> 1, 19, 2, 2, 3, 3, 2, 2, 1, 2, 1, 4, 4, 1, 4, 1, 1, 3, …
## $ t_x            <dbl> 3.2185499, 151.0097313, 23.0424788, 3.3377017, 21.64700…
## $ T_cal          <dbl> 365.2462, 365.1649, 365.2819, 365.2011, 365.2057, 365.1…
## $ min_lifetime   <dbl> 3.2185499, 151.0097313, 23.0424788, 3.3377017, 21.64700…
## $ max_lifetime   <dbl> 103.2185, 251.0097, 123.0425, 103.3377, 121.6470, 104.4…

For model validation purposes, we just need to check for the presence of not of customer transactions in the test set.

2 Construct Initial Lifetime Models

We now turn our attention to fitting our model to get estimates of the lifetime of the customer.

There is one important distinction between our lifetime model and the frequency model: we do not have direct observations of the lifetime of the customer.

Instead, what we have an exact observation of each customer’s minimum possible value for the lifetime: the time between the very first transaction that customer made and the most recently observed transaction.

We know that the customer’s lifetime is at least this number.

2.1 Fit Flat Model

We start with a simple flat model without any priors like we did for the frequency models.

## data {
##   int<lower=1> n;                  // count of observations
## 
##   vector<lower=0>[n] min_lifetime; // minimum observed lifetime for customer
##   vector<lower=0>[n] max_lifetime; // maximum observed lifetime for customer
##   vector<lower=0>[n] obs_time;     // observation time since customer 'birth'
## }
## 
## parameters {
##   vector<lower=0>[n] mu;
## }
## 
## model {
##   target += exponential_lccdf(min_lifetime | mu);
##   target += exponential_lcdf (max_lifetime | mu);
## }
## 
## generated quantities {
## #include lifetime_genquan.stan
## }

We now compile this model and fit it with our observed data for customer lifetimes.

ltmodel_flat_stanmodel <- cmdstan_model(
  "stan_code/ltmodel_flat.stan",
  include_paths =   stan_codedir,
  pedantic      =           TRUE,
  dir           =  stan_modeldir
  )

We then use this compiled model with our data to produce a fit of the data.

stan_modelname <- "ltmodel_flat"
stanfit_prefix <- str_c("fit_", stan_modelname) 

stan_data_lst <- training_stats_tbl %>%
  select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
  compose_data()

ltmodel_flat_stanfit <- ltmodel_flat_stanmodel$sample(
  data            =                 stan_data_lst,
  chains          =                             4,
  iter_warmup     =                           500,
  iter_sampling   =                           500,
  seed            =                          4201,
  output_dir      =                 stan_modeldir,
  output_basename =                stanfit_prefix
  )
## Running MCMC with 4 parallel chains...
## 
## Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 1 finished in 86.5 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 4 finished in 87.1 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 finished in 87.5 seconds.
## Chain 3 finished in 87.8 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 87.2 seconds.
## Total execution time: 88.6 seconds.
ltmodel_flat_stanfit$summary()
## # A tibble: 50,001 × 10
##    variable       mean   median      sd     mad       q5      q95  rhat ess_bulk
##    <chr>         <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl>    <dbl>
##  1 lp__       -3.73e+4 -3.73e+4 6.85e+1 6.88e+1 -3.74e+4 -3.72e+4 1.00      657.
##  2 mu[1]       3.19e-1  2.24e-1 3.07e-1 2.26e-1  2.53e-2  9.40e-1 1.00     2648.
##  3 mu[2]       9.11e-3  7.34e-3 7.19e-3 5.84e-3  1.27e-3  2.33e-2 1.00     2236.
##  4 mu[3]       4.77e-2  3.54e-2 4.18e-2 3.03e-2  5.94e-3  1.31e-1 1.00     2619.
##  5 mu[4]       3.14e-1  2.22e-1 3.07e-1 2.24e-1  2.13e-2  9.19e-1 1.00     2370.
##  6 mu[5]       5.28e-2  4.00e-2 4.52e-2 3.47e-2  6.48e-3  1.39e-1 1.00     2579.
##  7 mu[6]       2.31e-1  1.60e-1 2.24e-1 1.51e-1  2.03e-2  6.84e-1 1.00     2089.
##  8 mu[7]       8.50e-2  6.09e-2 7.78e-2 5.47e-2  9.15e-3  2.38e-1 0.999    2522.
##  9 mu[8]       8.11e-2  6.01e-2 7.38e-2 5.48e-2  9.65e-3  2.23e-1 1.00     2572.
## 10 mu[9]       5.83e-1  4.06e-1 5.94e-1 4.19e-1  3.99e-2  1.74e+0 1.00     1639.
## # … with 49,991 more rows, and 1 more variable: ess_tail <dbl>

We know this model has not fit properly, but we check the HMC diagnostics anyway.

ltmodel_flat_stanfit$cmdstan_diagnose()
## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_flat-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_flat-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_flat-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_flat-4.csv
## 
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
## 
## Checking sampler transitions for divergences.
## No divergent transitions found.
## 
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
## 
## Effective sample size satisfactory.
## 
## Split R-hat values satisfactory all parameters.
## 
## Processing complete, no problems detected.

As we see, this model is not sampling properly at all, but part of the problem is that we have an implicit uniform prior on \(\mu\) - which does not match our knowledge.

2.2 Fit Fixed-Prior Minimum Time Model

We add a Gamma prior to this model that we set via passing data into the sampler. As we also have less confidence in our maximum lifetime observation we also try fitting our model with just the minimum observed lifetime per customer as a first step.

## data {
##   int<lower=1> n;                  // count of observations
## 
##   vector<lower=0>[n] min_lifetime; // minimum observed lifetime for customer
##   vector<lower=0>[n] obs_time;     // observation time since customer 'birth'
## 
##   real s;
##   real beta;
## }
## 
## parameters {
##   vector<lower=0>[n] mu;
## }
## 
## model {
##   mu ~ gamma(s, beta);
## 
##   target += exponential_lccdf(min_lifetime | mu);
## }
## 
## generated quantities {
## #include lifetime_genquan.stan
## }

As always, we then compile our Stan model and fit it to the data.

ltmodel_mintime_stanmodel <- cmdstan_model(
  "stan_code/ltmodel_mintime.stan",
  include_paths =   stan_codedir,
  pedantic      =           TRUE,
  dir           =  stan_modeldir
  )

We then use this compiled model with our data to produce a fit of the data.

stan_modelname <- "ltmodel_mintime"
stanfit_prefix <- str_c("fit_", stan_modelname) 

stan_data_lst <- training_stats_tbl %>%
  select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
  compose_data(
    s     = 1,
    beta = 10
    )

ltmodel_mintime_stanfit <- ltmodel_mintime_stanmodel$sample(
  data            =                 stan_data_lst,
  chains          =                             4,
  iter_warmup     =                           500,
  iter_sampling   =                           500,
  seed            =                          4202,
  output_dir      =                 stan_modeldir,
  output_basename =                stanfit_prefix
  )
## Running MCMC with 4 parallel chains...
## 
## Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 1 finished in 84.2 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 4 finished in 84.5 seconds.
## Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 finished in 92.8 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 3 finished in 98.4 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 90.0 seconds.
## Total execution time: 99.8 seconds.
ltmodel_mintime_stanfit$summary()
## # A tibble: 50,001 × 10
##    variable       mean   median      sd     mad       q5      q95  rhat ess_bulk
##    <chr>         <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl>    <dbl>
##  1 lp__       -4.72e+4 -4.72e+4 7.85e+1 7.83e+1 -4.74e+4 -4.71e+4  1.00     725.
##  2 mu[1]       7.53e-2  5.23e-2 7.38e-2 5.19e-2  4.69e-3  2.20e-1  1.00    1782.
##  3 mu[2]       6.67e-3  4.45e-3 6.75e-3 4.72e-3  2.58e-4  2.01e-2  1.00    1989.
##  4 mu[3]       3.03e-2  2.12e-2 3.06e-2 2.19e-2  1.29e-3  8.87e-2  1.00    1923.
##  5 mu[4]       7.79e-2  5.31e-2 7.97e-2 5.49e-2  3.65e-3  2.33e-1  1.00    1335.
##  6 mu[5]       3.12e-2  2.17e-2 3.13e-2 2.21e-2  1.47e-3  9.56e-2  1.00    1640.
##  7 mu[6]       6.59e-2  4.27e-2 6.72e-2 4.55e-2  3.60e-3  2.05e-1  1.00    1525.
##  8 mu[7]       4.44e-2  3.27e-2 4.16e-2 3.25e-2  2.89e-3  1.27e-1  1.00    1730.
##  9 mu[8]       4.20e-2  2.92e-2 4.17e-2 3.03e-2  2.19e-3  1.25e-1  1.00    1747.
## 10 mu[9]       8.48e-2  5.96e-2 8.31e-2 6.00e-2  4.02e-3  2.46e-1  1.00    2024.
## # … with 49,991 more rows, and 1 more variable: ess_tail <dbl>

It appears that this fit is much less problematic, so we check the HMC diagnostics.

ltmodel_mintime_stanfit$cmdstan_diagnose()
## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_mintime-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_mintime-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_mintime-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_mintime-4.csv
## 
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
## 
## Checking sampler transitions for divergences.
## No divergent transitions found.
## 
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
## 
## Effective sample size satisfactory.
## 
## Split R-hat values satisfactory all parameters.
## 
## Processing complete, no problems detected.

We should also check some of the trace plots:

ltmodel_mintime_stanfit$draws() %>%
  mcmc_trace(
      pars = c("mu[1]", "mu[2]", "mu[3]", "mu[4]", "mu[5]", "mu[6]")
      ) +
    ggtitle("Traceplots of Some Mu Values")

2.2.1 Investigate Posterior Distribution of Customer Lifetime

We now want to visualise the posterior distribution of the expected lifetime of the customer.

ltmodel_mintime_validation_tbl <- ltmodel_mintime_stanfit %>%
  recover_types(training_stats_tbl) %>%
  spread_draws(mu[customer_id], tau[customer_id]) %>%
  ungroup() %>%
  inner_join(customer_simparams_tbl, by = "customer_id") %>%
  select(
    customer_id, first_tnx_date, draw_id = .draw,
    post_mu = mu, post_tau_mean = tau,
    customer_mu, customer_tau
    )

ltmodel_mintime_validation_tbl %>% glimpse()
## Rows: 20,000,000
## Columns: 7
## $ customer_id    <chr> "C201101_0002", "C201101_0002", "C201101_0002", "C20110…
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-0…
## $ draw_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ post_mu        <dbl> 0.19913000, 0.15009000, 0.09672760, 0.05766700, 0.06708…
## $ post_tau_mean  <dbl> 5.02186, 6.66268, 10.33830, 17.34100, 14.90580, 300.392…
## $ customer_mu    <dbl> 0.009023485, 0.009023485, 0.009023485, 0.009023485, 0.0…
## $ customer_tau   <dbl> 17.25275, 17.25275, 17.25275, 17.25275, 17.25275, 17.25…

We now use our existing routine to compare the posterior values for mu with the predetermined value.

ltmodel_mintime_mu_qvalues_tbl <- ltmodel_mintime_validation_tbl %>%
  calculate_distribution_qvals(post_mu, customer_mu, customer_id, first_tnx_date)

ref_value <- ltmodel_mintime_mu_qvalues_tbl %>%
  nrow() %>%
  divide_by(50)

ggplot(ltmodel_mintime_mu_qvalues_tbl) +
  geom_histogram(aes(x = q_val), bins = 50) +
  geom_hline(aes(yintercept = ref_value), colour = "red") +
  labs(
    x = "Quantile",
    y = "Count",
    title = "Quantile Plot of the q-Values for the Mu Posterior Distribution (mintime Model)"
    )

ggplot(ltmodel_mintime_mu_qvalues_tbl) +
  geom_point(aes(x = q_val, y = customer_mu), size = 1, alpha = 0.05) +
  labs(
    x = "Quantile",
    y = "Customer Mu",
    title = "Scatterplot of q-Value against Mu (mintime Model)"
    )

We repeat this process for looking at the \(\tau\) value to see the effect on average lifetime.

ltmodel_mintime_tau_qvalues_tbl <- ltmodel_mintime_validation_tbl %>%
  calculate_distribution_qvals(post_tau_mean, customer_tau, customer_id, first_tnx_date)

ref_value <- ltmodel_mintime_tau_qvalues_tbl %>%
  nrow() %>%
  divide_by(50)

ggplot(ltmodel_mintime_tau_qvalues_tbl) +
  geom_histogram(aes(x = q_val), bins = 50) +
  geom_hline(aes(yintercept = ref_value), colour = "red") +
  labs(
    x = "Quantile",
    y = "Count",
    title = "Quantile Plot of the q-Values for the Tau Posterior Distribution (mintime Model)"
    )

ggplot(ltmodel_mintime_tau_qvalues_tbl) +
  geom_point(aes(x = q_val, y = customer_tau)) +
  labs(
    x = "Quantile",
    y = "Customer Mu",
    title = "Scatterplot of q-Value against Tau (mintime Model)"
    )

It would appear that recovering the parameters for the lifetime model is more difficult than for the frequency model. This is not surprising, as our method for measuring customer lifetime is much noisier, as all we can observe is the minimum value of this lifetime, so it follows our estimates come with much wider uncertainty intervals.

Our issue is not just precision of estimates though - we also seem to have an issue with bias, introduced by the fact that our observations of lifetime are so noisy.

This needs much further investigation, and we calculate some summary statistics for these parameters and plot them against our known ‘true’ values for these quantities.

ltmodel_mintime_postsummary_tbl <- ltmodel_mintime_validation_tbl %>%
  group_by(customer_id) %>%
  summarise(
    .groups = "drop",
    
    mu_p10       = quantile(post_mu, 0.10),
    mu_p25       = quantile(post_mu, 0.25),
    mu_median    = median  (post_mu),
    mu_mean      = mean    (post_mu),
    mu_p75       = quantile(post_mu, 0.75),
    mu_p90       = quantile(post_mu, 0.90),
    
    tau_p10      = quantile(post_tau_mean, 0.10),
    tau_p25      = quantile(post_tau_mean, 0.25),
    tau_median   = median  (post_tau_mean),
    tau_mean     = mean    (post_tau_mean),
    tau_p75      = quantile(post_tau_mean, 0.75),
    tau_p90      = quantile(post_tau_mean, 0.90),
    
    customer_mu  = unique(customer_mu),
    customer_tau = unique(customer_tau)
    )

ltmodel_mintime_postsummary_tbl %>% glimpse()
## Rows: 10,000
## Columns: 15
## $ customer_id  <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C201101_…
## $ mu_p10       <dbl> 0.0092106810, 0.0006478439, 0.0030852110, 0.0078394130, 0…
## $ mu_p25       <dbl> 0.023857900, 0.001869510, 0.008761672, 0.022462875, 0.009…
## $ mu_median    <dbl> 0.052294450, 0.004446245, 0.021176750, 0.053145600, 0.021…
## $ mu_mean      <dbl> 0.075263001, 0.006667289, 0.030261938, 0.077919382, 0.031…
## $ mu_p75       <dbl> 0.102917750, 0.009228000, 0.042662625, 0.108517500, 0.042…
## $ mu_p90       <dbl> 0.168895000, 0.015422570, 0.068355390, 0.177157300, 0.072…
## $ tau_p10      <dbl> 5.920841, 64.840050, 14.629450, 5.644710, 13.827150, 6.41…
## $ tau_p25      <dbl> 9.716540, 108.365750, 23.439700, 9.215087, 23.797925, 10.…
## $ tau_median   <dbl> 19.12250, 224.90900, 47.22160, 18.81625, 46.00295, 23.434…
## $ tau_mean     <dbl> 75.51625, 1070.87630, 1440.37341, 394.15380, 295.87336, 8…
## $ tau_p75      <dbl> 41.91505, 534.90125, 114.13375, 44.51793, 108.24575, 53.6…
## $ tau_p90      <dbl> 108.5700, 1543.5870, 324.1269, 127.5604, 305.6213, 139.33…
## $ customer_mu  <dbl> 0.009023485, 0.016847819, 0.048667792, 0.141250317, 0.058…
## $ customer_tau <dbl> 17.252751, 155.215456, 28.310863, 9.680664, 36.956593, 4.…

We can now visualise these summary statistics.

plot_tbl <- ltmodel_mintime_postsummary_tbl %>%
  slice_sample(n = 25) %>%
  arrange(customer_id)

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
    width = 0, size = 1) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
    width = 0, size = 4) +
  geom_point(
    aes(x = customer_id, y = customer_mu),
    colour = "red", size = 2.5) +
  labs(
    x = "Customer ID",
    y = "Posterior Mu",
    title = "Summary Statistics of Mu Values"
    ) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
    width = 0, size = 1) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
    width = 0, size = 4) +
  geom_point(
    aes(x = customer_id, y = customer_tau),
    colour = "red", size = 2.5) +
  scale_y_log10(labels = label_comma()) +
  labs(
    x = "Customer ID",
    y = "Posterior Tau",
    title = "Summary Statistics of Tau Values"
    ) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

2.2.2 Investigate q_value Cohorts

We produced a scatter plot of customer_mu against q_value and it does not look correct - the q-value should be independent of the value of customer_mu so we need to look into this further.

To start with, we look at the lower end of the q_value distribution.

plot_tbl <- ltmodel_mintime_mu_qvalues_tbl %>%
  filter(q_val < 0.1) %>%
  select(customer_id) %>%
  inner_join(ltmodel_mintime_postsummary_tbl, by = "customer_id")

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
    size = 1, width = 0) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
    size = 3, width = 0) +
  geom_point(
    aes(x = customer_id, y = customer_mu),
    colour = "red") +
  labs(
    x = "Customer ID",
    y = "Lifetime Mu",
    title = "Posterior Estimates of Small q-Value Customers") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

We are looking at customer_mu but it might require instead looking at customer_tau.

### This is not 
plot_tbl <- ltmodel_mintime_mu_qvalues_tbl %>%
  filter(q_val < 0.1) %>%
  select(customer_id) %>%
  inner_join(ltmodel_mintime_postsummary_tbl, by = "customer_id")

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
    size = 1, width = 0) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
    size = 3, width = 0) +
  geom_point(
    aes(x = customer_id, y = customer_tau),
    colour = "red") +
  labs(
    x = "Customer ID",
    y = "Lifetime Tau",
    title = "Posterior Estimates of Small q-Value Customers") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

A few outliers are skewing this, so we add a log-scale to the \(y\)-axis plot.

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
    size = 1, width = 0) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
    size = 3, width = 0) +
  geom_point(
    aes(x = customer_id, y = customer_tau),
    colour = "red") +
  scale_y_log10(labels = label_comma()) +
  labs(
    x = "Customer ID",
    y = "Lifetime Tau",
    title = "Posterior Estimates of Small q-Value Customers") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

It seems that the fact we only have censored observations of the lifetime is going to impair the ability of the model to recover parameters.

We continue for now though, and see if additions to the model can improve things.

2.3 Fit Fixed-Prior Min/Max Time Model

It is clear that we need to improve our approach for estimating customer lifetimes. We try including the maximum time in the model.

## data {
##   int<lower=1> n;                  // count of observations
## 
##   vector<lower=0>[n] min_lifetime; // minimum observed lifetime for customer
##   vector<lower=0>[n] max_lifetime; // maximum observed lifetime for customer
##   vector<lower=0>[n] obs_time;     // observation time since customer 'birth'
## 
##   real s;
##   real beta;
## }
## 
## parameters {
##   vector<lower=0>[n] mu;
## }
## 
## model {
##   mu ~ gamma(s, beta);
## 
##   target += exponential_lccdf(min_lifetime | mu);
##   target += exponential_lcdf (max_lifetime | mu);
## }
## 
## generated quantities {
## #include lifetime_genquan.stan
## }

As always, we then compile our Stan model and fit it to the data.

ltmodel_minmax_stanmodel <- cmdstan_model(
  "stan_code/ltmodel_minmax.stan",
  include_paths =   stan_codedir,
  pedantic      =           TRUE,
  dir           =  stan_modeldir
  )

With this new model we use the same data again to fit our new model.

stan_modelname <- "ltmodel_minmax"
stanfit_prefix <- str_c("fit_", stan_modelname) 

stan_data_lst <- training_stats_tbl %>%
  select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
  compose_data(
    s     = 1,
    beta = 10
    )

ltmodel_minmax_stanfit <- ltmodel_minmax_stanmodel$sample(
  data            =                 stan_data_lst,
  chains          =                             4,
  iter_warmup     =                           500,
  iter_sampling   =                           500,
  seed            =                          4203,
  output_dir      =                 stan_modeldir,
  output_basename =                stanfit_prefix
  )
## Running MCMC with 4 parallel chains...
## 
## Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 3 finished in 86.8 seconds.
## Chain 4 finished in 86.8 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 finished in 87.5 seconds.
## Chain 1 finished in 88.8 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 87.5 seconds.
## Total execution time: 89.8 seconds.
ltmodel_minmax_stanfit$summary()
## # A tibble: 50,001 × 10
##    variable       mean   median      sd     mad       q5      q95  rhat ess_bulk
##    <chr>         <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl>    <dbl>
##  1 lp__       -4.73e+4 -4.73e+4 7.37e+1 7.15e+1 -4.74e+4 -4.72e+4 1.00      750.
##  2 mu[1]       8.39e-2  6.12e-2 7.55e-2 5.24e-2  9.48e-3  2.37e-1 0.999    2591.
##  3 mu[2]       8.59e-3  6.90e-3 6.41e-3 5.09e-3  1.52e-3  2.10e-2 1.00     3156.
##  4 mu[3]       3.69e-2  2.82e-2 3.18e-2 2.25e-2  4.86e-3  1.03e-1 1.00     3906.
##  5 mu[4]       8.18e-2  5.91e-2 7.26e-2 5.36e-2  9.63e-3  2.32e-1 1.00     3202.
##  6 mu[5]       3.72e-2  2.87e-2 3.00e-2 2.35e-2  6.00e-3  9.68e-2 1.00     2625.
##  7 mu[6]       7.80e-2  5.69e-2 6.89e-2 5.15e-2  9.15e-3  2.13e-1 0.999    3407.
##  8 mu[7]       5.35e-2  3.85e-2 4.82e-2 3.33e-2  6.85e-3  1.52e-1 1.00     2795.
##  9 mu[8]       5.03e-2  3.73e-2 4.40e-2 3.28e-2  6.18e-3  1.34e-1 0.999    2811.
## 10 mu[9]       9.59e-2  6.77e-2 8.82e-2 6.11e-2  1.10e-2  2.75e-1 0.999    2629.
## # … with 49,991 more rows, and 1 more variable: ess_tail <dbl>

As before, we need to check the HMC diagnostics.

ltmodel_minmax_stanfit$cmdstan_diagnose()
## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-4.csv
## 
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
## 
## Checking sampler transitions for divergences.
## No divergent transitions found.
## 
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
## 
## Effective sample size satisfactory.
## 
## Split R-hat values satisfactory all parameters.
## 
## Processing complete, no problems detected.

2.3.1 Plot Customer Lifetime Diagnostic Plots

We now want to visualise the posterior distribution of the expected lifetime of the customer.

ltmodel_minmax_validation_tbl <- ltmodel_minmax_stanfit %>%
  recover_types(training_stats_tbl) %>%
  spread_draws(mu[customer_id], tau[customer_id]) %>%
  ungroup() %>%
  inner_join(customer_simparams_tbl, by = "customer_id") %>%
  select(
    customer_id, first_tnx_date, draw_id = .draw,
    post_mu = mu, post_tau_mean = tau,
    customer_mu, customer_tau
    )

ltmodel_minmax_validation_tbl %>% glimpse()
## Rows: 20,000,000
## Columns: 7
## $ customer_id    <chr> "C201101_0002", "C201101_0002", "C201101_0002", "C20110…
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-0…
## $ draw_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ post_mu        <dbl> 0.47066600, 0.07995400, 0.07683270, 0.08568460, 0.01306…
## $ post_tau_mean  <dbl> 2.12465, 12.50720, 13.01530, 11.67070, 76.52430, 4.1747…
## $ customer_mu    <dbl> 0.009023485, 0.009023485, 0.009023485, 0.009023485, 0.0…
## $ customer_tau   <dbl> 17.25275, 17.25275, 17.25275, 17.25275, 17.25275, 17.25…

We now use our existing routine to compare the posterior values for mu with the predetermined value.

ltmodel_minmax_mu_qvalues_tbl <- ltmodel_minmax_validation_tbl %>%
  calculate_distribution_qvals(post_mu, customer_mu, customer_id, first_tnx_date)

ref_value <- ltmodel_minmax_mu_qvalues_tbl %>%
  nrow() %>%
  divide_by(50)

ggplot(ltmodel_minmax_mu_qvalues_tbl) +
  geom_histogram(aes(x = q_val), bins = 50) +
  geom_hline(aes(yintercept = ref_value), colour = "red") +
  labs(
    x = "Quantile",
    y = "Count",
    title = "Quantile Plot of the q-Values for the Posterior Distribution"
    )

ggplot(ltmodel_minmax_mu_qvalues_tbl) +
  geom_point(aes(x = q_val, y = customer_mu)) +
  labs(
    x = "Quantile",
    y = "Customer Mu",
    title = "Scatterplot of q-Value against Mu"
    )

We repeat this process for looking at the \(\tau\) value to see the effect on average lifetime.

ltmodel_minmax_tau_qvalues_tbl <- ltmodel_minmax_validation_tbl %>%
  calculate_distribution_qvals(post_tau_mean, customer_tau, customer_id, first_tnx_date)

ref_value <- ltmodel_minmax_tau_qvalues_tbl %>%
  nrow() %>%
  divide_by(50)

ggplot(ltmodel_minmax_tau_qvalues_tbl) +
  geom_histogram(aes(x = q_val), bins = 50) +
  geom_hline(aes(yintercept = ref_value), colour = "red") +
  labs(
    x = "Quantile",
    y = "Count",
    title = "Quantile Plot of the q-Values for the Tau Posterior Distribution (minmax Model)"
    )

ggplot(ltmodel_minmax_tau_qvalues_tbl) +
  geom_point(aes(x = q_val, y = customer_tau)) +
  labs(
    x = "Quantile",
    y = "Customer Mu",
    title = "Scatterplot of q-Value against Tau (minmax Model)"
    )

2.3.2 Calculate minmax Model Posterior Summary Statistics

We want to calculate some posterior quantities summary statistics.

ltmodel_minmax_postsummary_tbl <- ltmodel_minmax_validation_tbl %>%
  group_by(customer_id) %>%
  summarise(
    .groups = "drop",
    
    mu_p10       = quantile(post_mu, 0.10),
    mu_p25       = quantile(post_mu, 0.25),
    mu_median    = median  (post_mu),
    mu_mean      = mean    (post_mu),
    mu_p75       = quantile(post_mu, 0.75),
    mu_p90       = quantile(post_mu, 0.90),
    
    tau_p10      = quantile(post_tau_mean, 0.10),
    tau_p25      = quantile(post_tau_mean, 0.25),
    tau_median   = median  (post_tau_mean),
    tau_mean     = mean    (post_tau_mean),
    tau_p75      = quantile(post_tau_mean, 0.75),
    tau_p90      = quantile(post_tau_mean, 0.90),
    
    customer_mu  = unique(customer_mu),
    customer_tau = unique(customer_tau)
    )

ltmodel_minmax_postsummary_tbl %>% glimpse()
## Rows: 10,000
## Columns: 15
## $ customer_id  <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C201101_…
## $ mu_p10       <dbl> 0.015611020, 0.002247156, 0.007969247, 0.015621790, 0.008…
## $ mu_p25       <dbl> 0.030674475, 0.003973943, 0.015086350, 0.029101200, 0.015…
## $ mu_median    <dbl> 0.061161550, 0.006903310, 0.028158650, 0.059064800, 0.028…
## $ mu_mean      <dbl> 0.083906509, 0.008591510, 0.036924538, 0.081846056, 0.037…
## $ mu_p75       <dbl> 0.11321400, 0.01155767, 0.04844448, 0.11305950, 0.0511908…
## $ mu_p90       <dbl> 0.17922390, 0.01706617, 0.07777854, 0.17814330, 0.0761281…
## $ tau_p10      <dbl> 5.579605, 58.595380, 12.857000, 5.613448, 13.135870, 5.75…
## $ tau_p25      <dbl> 8.832825, 86.522675, 20.642150, 8.844893, 19.534750, 9.23…
## $ tau_median   <dbl> 16.35015, 144.85800, 35.51305, 16.93060, 34.84010, 17.572…
## $ tau_mean     <dbl> 34.47874, 223.04430, 62.52966, 31.97575, 63.05222, 33.679…
## $ tau_p75      <dbl> 32.60040, 251.63950, 66.28508, 34.36285, 64.33345, 35.164…
## $ tau_p90      <dbl> 64.05778, 445.00800, 125.48650, 64.01318, 119.72810, 70.3…
## $ customer_mu  <dbl> 0.009023485, 0.016847819, 0.048667792, 0.141250317, 0.058…
## $ customer_tau <dbl> 17.252751, 155.215456, 28.310863, 9.680664, 36.956593, 4.…

We can now visualise these summary statistics.

plot_tbl <- ltmodel_minmax_postsummary_tbl %>%
  slice_sample(n = 25) %>%
  arrange(customer_id)

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
    width = 0, size = 1) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
    width = 0, size = 4) +
  geom_point(
    aes(x = customer_id, y = customer_mu),
    colour = "red", size = 2.5) +
  labs(
    x = "Customer ID",
    y = "Posterior Mu",
    title = "Summary Statistics of Mu Values"
    ) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
    width = 0, size = 1) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
    width = 0, size = 4) +
  geom_point(
    aes(x = customer_id, y = customer_tau),
    colour = "red", size = 2.5) +
  scale_y_log10(labels = label_comma()) +
  labs(
    x = "Customer ID",
    y = "Posterior Tau",
    title = "Summary Statistics of Tau Values"
    ) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

2.3.3 Investigate Extreme q-Value

As we have done with our “Minimum Time” model, we look at the data where the \(q\) value is very low.

plot_tbl <- ltmodel_minmax_mu_qvalues_tbl %>%
  filter(q_val < 0.05) %>%
  select(customer_id) %>%
  inner_join(ltmodel_minmax_postsummary_tbl, by = "customer_id")

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
    size = 1, width = 0) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
    size = 3, width = 0) +
  geom_point(
    aes(x = customer_id, y = customer_mu),
    colour = "red") +
  labs(
    x = "Customer ID",
    y = "Lifetime Mu",
    title = "Posterior Estimates of Small q-Value Customers") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

We are looking at customer_mu but it might require instead looking at customer_tau.

### This is not 
plot_tbl <- ltmodel_minmax_mu_qvalues_tbl %>%
  filter(q_val < 0.05) %>%
  select(customer_id) %>%
  inner_join(ltmodel_minmax_postsummary_tbl, by = "customer_id")

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
    size = 1, width = 0) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
    size = 3, width = 0) +
  geom_point(
    aes(x = customer_id, y = customer_tau),
    colour = "red") +
  labs(
    x = "Customer ID",
    y = "Lifetime Tau",
    title = "Posterior Estimates of Small q-Value Customers") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

2.4 Fit Tighter Fixed-Prior Min/Max Time Model

We now fit the same model but with a tighter prior around \(\mu\).

stan_modelname <- "ltmodel_minmax"
stanfit_prefix <- str_c("fit_", stan_modelname) 

stan_data_lst <- training_stats_tbl %>%
  select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
  compose_data(
    s     = 5,
    beta = 50
    )

ltmodel_tightprior_stanfit <- ltmodel_minmax_stanmodel$sample(
  data            =                 stan_data_lst,
  chains          =                             4,
  iter_warmup     =                           500,
  iter_sampling   =                           500,
  seed            =                          4204,
  output_dir      =                 stan_modeldir,
  output_basename =                stanfit_prefix
  )
## Running MCMC with 4 parallel chains...
## 
## Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 4 finished in 74.2 seconds.
## Chain 2 finished in 75.1 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 3 finished in 75.1 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 1 finished in 75.8 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 75.1 seconds.
## Total execution time: 76.4 seconds.
ltmodel_tightprior_stanfit$summary()
## # A tibble: 50,001 × 10
##    variable       mean   median      sd     mad       q5      q95  rhat ess_bulk
##    <chr>         <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl>    <dbl>
##  1 lp__       -1.85e+5 -1.85e+5 72.5    69.7    -1.85e+5 -1.85e+5 1.00      724.
##  2 mu[1]       9.47e-2  8.88e-2  0.0412  0.0400  3.83e-2  1.70e-1 1.00     3611.
##  3 mu[2]       2.53e-2  2.36e-2  0.0112  0.0106  1.03e-2  4.70e-2 1.00     3683.
##  4 mu[3]       6.90e-2  6.36e-2  0.0322  0.0293  2.68e-2  1.30e-1 1.00     3185.
##  5 mu[4]       9.46e-2  8.80e-2  0.0410  0.0382  3.90e-2  1.72e-1 1.00     3052.
##  6 mu[5]       7.00e-2  6.42e-2  0.0320  0.0306  2.70e-2  1.28e-1 1.00     2796.
##  7 mu[6]       9.17e-2  8.56e-2  0.0422  0.0398  3.40e-2  1.70e-1 1.00     3720.
##  8 mu[7]       7.94e-2  7.43e-2  0.0347  0.0324  3.19e-2  1.44e-1 1.00     3477.
##  9 mu[8]       7.84e-2  7.30e-2  0.0345  0.0327  3.09e-2  1.43e-1 1.00     3563.
## 10 mu[9]       9.54e-2  9.05e-2  0.0405  0.0380  3.78e-2  1.68e-1 0.999    2580.
## # … with 49,991 more rows, and 1 more variable: ess_tail <dbl>

As before, we need to check the HMC diagnostics.

ltmodel_tightprior_stanfit$cmdstan_diagnose()
## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_minmax-4.csv
## 
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
## 
## Checking sampler transitions for divergences.
## No divergent transitions found.
## 
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
## 
## Effective sample size satisfactory.
## 
## Split R-hat values satisfactory all parameters.
## 
## Processing complete, no problems detected.

2.4.1 Plot Customer Lifetime Diagnostic Plots

Once again we construct validation plots for our data - comparing the posterior distributions against the known customer value.

ltmodel_tightprior_validation_tbl <- ltmodel_tightprior_stanfit %>%
  recover_types(training_stats_tbl) %>%
  spread_draws(mu[customer_id], tau[customer_id]) %>%
  ungroup() %>%
  inner_join(customer_simparams_tbl, by = "customer_id") %>%
  select(
    customer_id, first_tnx_date, draw_id = .draw,
    post_mu = mu, post_tau_mean = tau,
    customer_mu, customer_tau
    )

ltmodel_tightprior_validation_tbl %>% glimpse()
## Rows: 20,000,000
## Columns: 7
## $ customer_id    <chr> "C201101_0002", "C201101_0002", "C201101_0002", "C20110…
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-0…
## $ draw_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ post_mu        <dbl> 0.1100270, 0.1016490, 0.0418283, 0.1075610, 0.1600910, …
## $ post_tau_mean  <dbl> 9.08870, 9.83780, 23.90730, 9.29704, 6.24645, 9.09592, …
## $ customer_mu    <dbl> 0.009023485, 0.009023485, 0.009023485, 0.009023485, 0.0…
## $ customer_tau   <dbl> 17.25275, 17.25275, 17.25275, 17.25275, 17.25275, 17.25…

We now use our existing routine to compare the posterior values for mu with the predetermined value.

ltmodel_tightprior_mu_qvalues_tbl <- ltmodel_tightprior_validation_tbl %>%
  calculate_distribution_qvals(post_mu, customer_mu, customer_id, first_tnx_date)

ref_value <- ltmodel_tightprior_mu_qvalues_tbl %>%
  nrow() %>%
  divide_by(50)

ggplot(ltmodel_tightprior_mu_qvalues_tbl) +
  geom_histogram(aes(x = q_val), bins = 50) +
  geom_hline(aes(yintercept = ref_value), colour = "red") +
  labs(
    x = "Quantile",
    y = "Count",
    title = "Quantile Plot of the q-Values for the Posterior Distribution"
    )

ggplot(ltmodel_tightprior_mu_qvalues_tbl) +
  geom_point(aes(x = q_val, y = customer_mu)) +
  labs(
    x = "Quantile",
    y = "Customer Mu",
    title = "Scatterplot of q-Value against Mu"
    )

2.4.2 Calculate tightprior Model Posterior Summary Statistics

We want to calculate some posterior quantities summary statistics.

ltmodel_tightprior_postsummary_tbl <- ltmodel_tightprior_validation_tbl %>%
  group_by(customer_id) %>%
  summarise(
    .groups = "drop",
    
    mu_p10       = quantile(post_mu, 0.10),
    mu_p25       = quantile(post_mu, 0.25),
    mu_median    = median  (post_mu),
    mu_mean      = mean    (post_mu),
    mu_p75       = quantile(post_mu, 0.75),
    mu_p90       = quantile(post_mu, 0.90),
    
    tau_p10      = quantile(post_tau_mean, 0.10),
    tau_p25      = quantile(post_tau_mean, 0.25),
    tau_median   = median  (post_tau_mean),
    tau_mean     = mean    (post_tau_mean),
    tau_p75      = quantile(post_tau_mean, 0.75),
    tau_p90      = quantile(post_tau_mean, 0.90),
    
    customer_mu  = unique(customer_mu),
    customer_tau = unique(customer_tau)
    )

ltmodel_tightprior_postsummary_tbl %>% glimpse()
## Rows: 10,000
## Columns: 15
## $ customer_id  <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C201101_…
## $ mu_p10       <dbl> 0.046494800, 0.012536390, 0.033793800, 0.048006760, 0.033…
## $ mu_p25       <dbl> 0.06424620, 0.01683717, 0.04572955, 0.06492627, 0.0465805…
## $ mu_median    <dbl> 0.08876680, 0.02364685, 0.06358135, 0.08796385, 0.0641517…
## $ mu_mean      <dbl> 0.09469945, 0.02527723, 0.06897114, 0.09458880, 0.0699766…
## $ mu_p75       <dbl> 0.11905750, 0.03158462, 0.08611853, 0.11793100, 0.0888050…
## $ mu_p90       <dbl> 0.14998180, 0.04081577, 0.11100260, 0.15017330, 0.1130137…
## $ tau_p10      <dbl> 6.667497, 24.500340, 9.008766, 6.658982, 8.848469, 6.7340…
## $ tau_p25      <dbl> 8.399290, 31.660975, 11.611900, 8.479540, 11.260650, 8.68…
## $ tau_median   <dbl> 11.26545, 42.28900, 15.72790, 11.36830, 15.58805, 11.6876…
## $ tau_mean     <dbl> 13.04351, 48.91947, 18.29236, 13.02305, 17.93547, 13.8739…
## $ tau_p75      <dbl> 15.56510, 59.39232, 21.86772, 15.40207, 21.46817, 16.3357…
## $ tau_p90      <dbl> 21.50777, 79.76759, 29.59126, 20.83043, 29.59817, 23.2381…
## $ customer_mu  <dbl> 0.009023485, 0.016847819, 0.048667792, 0.141250317, 0.058…
## $ customer_tau <dbl> 17.252751, 155.215456, 28.310863, 9.680664, 36.956593, 4.…

We can now visualise these summary statistics.

plot_tbl <- ltmodel_tightprior_postsummary_tbl %>%
  slice_sample(n = 25) %>%
  arrange(customer_id)

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
    width = 0, size = 1) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
    width = 0, size = 4) +
  geom_point(
    aes(x = customer_id, y = customer_mu),
    colour = "red", size = 2.5) +
  labs(
    x = "Customer ID",
    y = "Posterior Mu",
    title = "Summary Statistics of Mu Values"
    ) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
    width = 0, size = 1) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
    width = 0, size = 4) +
  geom_point(
    aes(x = customer_id, y = customer_tau),
    colour = "red", size = 2.5) +
  scale_y_log10(labels = label_comma()) +
  labs(
    x = "Customer ID",
    y = "Posterior Tau",
    title = "Summary Statistics of Tau Values"
    ) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

2.4.3 Investigate Extreme q-Value

As we have done with our “Tight Prior” model, we look at the data where the \(q\) value is very low.

plot_tbl <- ltmodel_tightprior_mu_qvalues_tbl %>%
  filter(q_val < 0.025) %>%
  select(customer_id) %>%
  inner_join(ltmodel_tightprior_postsummary_tbl, by = "customer_id")

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
    size = 1, width = 0) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
    size = 3, width = 0) +
  geom_point(
    aes(x = customer_id, y = customer_mu),
    colour = "red") +
  labs(
    x = "Customer ID",
    y = "Lifetime Mu",
    title = "Posterior Estimates of Small q-Value Customers") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

We are looking at customer_mu but it might require instead looking at customer_tau.

### This is not 
plot_tbl <- ltmodel_tightprior_mu_qvalues_tbl %>%
  filter(q_val < 0.025) %>%
  select(customer_id) %>%
  inner_join(ltmodel_minmax_postsummary_tbl, by = "customer_id")

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
    size = 1, width = 0) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
    size = 3, width = 0) +
  geom_point(
    aes(x = customer_id, y = customer_tau),
    colour = "red") +
  labs(
    x = "Customer ID",
    y = "Lifetime Tau",
    title = "Posterior Estimates of Small q-Value Customers") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

3 Construct the Hierarchical Lifetime Model

We now look at adding uncertainty to our priors, as we did with the frequency model.

## data {
##   int<lower=1> n;               // count of observations
## 
##   vector<lower=0>[n] min_lifetime; // minimum observed lifetime for customer
##   vector<lower=0>[n] max_lifetime; // maximum observed lifetime for customer
##   vector<lower=0>[n] obs_time;     // observation time since customer 'birth'
## 
##   real<lower=0> mean_p1, mean_p2;
##   real<lower=0> cov_p1, cov_p2;
## }
## 
## parameters {
##   real<lower=0> hier_mean;
##   real<lower=0> hier_cov;
## 
##   vector<lower=0>[n] mu;
## }
## 
## transformed parameters {
##   // shape <- 1 / (cv^2)
##   // scale <- mu * cv^2
## 
##   real<lower=0> s    = 1 / (hier_cov  * hier_cov);
##   real<lower=0> beta = 1 / (hier_mean * hier_cov * hier_cov);
## }
## 
## model {
##   hier_mean ~ gamma(mean_p1, mean_p1);
##   hier_cov  ~ gamma(cov_p1,  cov_p2);
## 
##   mu ~ gamma(s, beta);
## 
##   target += exponential_lccdf(min_lifetime | mu);
##   target += exponential_lcdf (max_lifetime | mu);
## }
## 
## generated quantities {
## #include lifetime_genquan.stan
## }

We see this model if very similar to the frequency model, using Gamma priors.

3.1 Compile and Fit First Hierarchical Model

We first need to compile this fitted model.

ltmodel_hier_stanmodel <- cmdstan_model(
  "stan_code/ltmodel_hier.stan",
  include_paths = stan_codedir,
  pedantic      =  TRUE,
  dir           = stan_modeldir
  )
stan_modelname <- "ltmodel_hier"
stanfit_prefix <- str_c("fit_", stan_modelname) 

stan_data_lst <- training_stats_tbl %>%
  select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
  compose_data(
    mean_p1 =  5,
    mean_p2 = 50,
    
    cov_p1  = 5,
    cov_p2  = 5
    )

ltmodel_hier_stanfit <- ltmodel_hier_stanmodel$sample(
  data            =                 stan_data_lst,
  chains          =                             4,
  iter_warmup     =                           500,
  iter_sampling   =                           500,
  seed            =                          4205,
  output_dir      =                 stan_modeldir,
  output_basename =                stanfit_prefix
  )
## Running MCMC with 4 parallel chains...
## 
## Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 1 finished in 180.6 seconds.
## Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 3 finished in 185.9 seconds.
## Chain 4 finished in 185.8 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 finished in 189.4 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 185.4 seconds.
## Total execution time: 190.7 seconds.
ltmodel_hier_stanfit$summary()
## # A tibble: 50,005 × 10
##    variable       mean    median       sd     mad      q5     q95  rhat ess_bulk
##    <chr>         <dbl>     <dbl>    <dbl>   <dbl>   <dbl>   <dbl> <dbl>    <dbl>
##  1 lp__      6291.     6330.      1.39e+3 1.40e+3 3.90e+3 8.65e+3  2.10     5.30
##  2 hier_mean    0.0185    0.0184  1.77e-4 1.74e-4 1.82e-2 1.88e-2  1.26    12.7 
##  3 hier_cov     0.0680    0.0671  9.47e-3 9.31e-3 5.30e-2 8.55e-2  2.09     5.32
##  4 mu[1]        0.0185    0.0185  1.28e-3 1.22e-3 1.64e-2 2.06e-2  1.01  1757.  
##  5 mu[2]        0.0182    0.0183  1.23e-3 1.24e-3 1.63e-2 2.03e-2  1.01  1662.  
##  6 mu[3]        0.0184    0.0184  1.29e-3 1.24e-3 1.64e-2 2.07e-2  1.01  2018.  
##  7 mu[4]        0.0185    0.0185  1.31e-3 1.30e-3 1.64e-2 2.07e-2  1.01  1908.  
##  8 mu[5]        0.0184    0.0184  1.29e-3 1.25e-3 1.63e-2 2.05e-2  1.01  1382.  
##  9 mu[6]        0.0185    0.0185  1.25e-3 1.21e-3 1.64e-2 2.06e-2  1.01  1419.  
## 10 mu[7]        0.0185    0.0184  1.33e-3 1.25e-3 1.64e-2 2.08e-2  1.02  1710.  
## # … with 49,995 more rows, and 1 more variable: ess_tail <dbl>

As before, we need to check the HMC diagnostics.

ltmodel_hier_stanfit$cmdstan_diagnose()
## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_hier-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier-4.csv
## 
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
## 
## Checking sampler transitions for divergences.
## No divergent transitions found.
## 
## Checking E-BFMI - sampler transitions HMC potential energy.
## The E-BFMI, 0.01, is below the nominal threshold of 0.30 which suggests that HMC may have trouble exploring the target distribution.
## If possible, try to reparameterize the model.
## 
## Effective sample size satisfactory.
## 
## The following parameters had split R-hat greater than 1.05:
##   hier_mean, hier_cov, s, beta
## Such high values indicate incomplete mixing and biased estimation.
## You should consider regularizating your model with additional prior information or a more effective parameterization.
## 
## Processing complete.

It looks like we have too many degrees of freedom with this model - our data is not sufficiently strong to allow the model to discriminate around parameters and instead we have identifiability issues.

Not that we are not finding any divergent transitions in this model - instead we find that our posterior sample has not converged.

To illustrate this we want to look at the traceplots for the hierarchical parameters.

mcmc_trace(
    x    = ltmodel_hier_stanfit$draws(),
    pars = c("hier_mean", "hier_cov")
    ) +
  ggtitle("Traceplots of Lifetime Hierarchical Parameter Values")

3.2 Compile and Fit One-Parameter Hierarchical Model

We need to reduce the degrees of freedom in this hierarchical model and so we fix one of the hierarchical parameters and instead allow just one of the Gamma parameters to be fit by the data.

ltmodel_hier_one_stanmodel <- cmdstan_model(
  "stan_code/ltmodel_hier_one.stan",
  include_paths = stan_codedir,
  dir           = stan_modeldir,
  pedantic      = TRUE
  )

We now fit this model using our data and our priors.

stan_modelname <- "ltmodel_hier_one"
stanfit_prefix <- str_c("fit_", stan_modelname) 

stan_data_lst <- training_stats_tbl %>%
  select(customer_id, obs_time = T_cal, min_lifetime, max_lifetime) %>%
  compose_data(
    mean_p1 =  2,
    mean_p2 = 20,

    s       =  1
    )

ltmodel_hier_one_stanfit <- ltmodel_hier_one_stanmodel$sample(
  data            =                 stan_data_lst,
  chains          =                             4,
  iter_warmup     =                           500,
  iter_sampling   =                           500,
  seed            =                          4205,
  output_dir      =                 stan_modeldir,
  output_basename =                stanfit_prefix
  )
## Running MCMC with 4 parallel chains...
## 
## Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 1 finished in 125.8 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 finished in 126.2 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 4 finished in 127.5 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 3 finished in 130.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 127.4 seconds.
## Total execution time: 131.8 seconds.
ltmodel_hier_one_stanfit$summary()
## # A tibble: 50,003 × 10
##    variable       mean   median      sd     mad       q5      q95  rhat ess_bulk
##    <chr>         <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl>    <dbl>
##  1 lp__       -2.09e+4 -2.09e+4 7.31e+1 7.44e+1 -2.10e+4 -2.08e+4 1.00      631.
##  2 hier_mean   2.76e-2  2.75e-2 4.85e-4 4.86e-4  2.68e-2  2.84e-2 1.00      931.
##  3 mu[1]       3.22e-2  2.57e-2 2.61e-2 2.03e-2  4.25e-3  8.44e-2 1.00     2706.
##  4 mu[2]       7.72e-3  6.12e-3 5.77e-3 4.70e-3  1.28e-3  1.92e-2 1.00     2438.
##  5 mu[3]       2.20e-2  1.77e-2 1.67e-2 1.24e-2  3.63e-3  5.51e-2 1.00     2521.
##  6 mu[4]       3.17e-2  2.46e-2 2.56e-2 1.89e-2  4.93e-3  8.28e-2 1.00     2453.
##  7 mu[5]       2.26e-2  1.82e-2 1.74e-2 1.35e-2  3.70e-3  5.70e-2 1.00     2454.
##  8 mu[6]       3.15e-2  2.47e-2 2.53e-2 1.92e-2  5.08e-3  8.38e-2 1.00     2261.
##  9 mu[7]       2.71e-2  2.14e-2 2.06e-2 1.60e-2  4.40e-3  6.77e-2 1.00     3033.
## 10 mu[8]       2.56e-2  2.03e-2 2.08e-2 1.59e-2  4.10e-3  6.53e-2 0.999    2748.
## # … with 49,993 more rows, and 1 more variable: ess_tail <dbl>

As before, we need to check the HMC diagnostics.

ltmodel_hier_one_stanfit$cmdstan_diagnose()
## Processing csv files: /home/rstudio/workshop/stan_models/fit_ltmodel_hier_one-1.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier_one-2.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier_one-3.csv, /home/rstudio/workshop/stan_models/fit_ltmodel_hier_one-4.csv
## 
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
## 
## Checking sampler transitions for divergences.
## No divergent transitions found.
## 
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
## 
## Effective sample size satisfactory.
## 
## Split R-hat values satisfactory all parameters.
## 
## Processing complete, no problems detected.

We should also check some of the trace plots:

mcmc_trace(
    x    = ltmodel_hier_one_stanfit$draws(),
    pars = c("hier_mean")
    ) +
  ggtitle("Traceplots of Lifetime Hierarchical Parameter Values")

3.2.1 Plot Customer Lifetime Diagnostic Plots

Once again we construct validation plots for our data - comparing the posterior distributions against the known customer value.

ltmodel_hier_one_validation_tbl <- ltmodel_hier_one_stanfit %>%
  recover_types(training_stats_tbl) %>%
  spread_draws(mu[customer_id], tau[customer_id]) %>%
  ungroup() %>%
  inner_join(customer_simparams_tbl, by = "customer_id") %>%
  select(
    customer_id, first_tnx_date, draw_id = .draw,
    post_mu = mu, post_tau_mean = tau,
    customer_mu, customer_tau
    )

ltmodel_hier_one_validation_tbl %>% glimpse()
## Rows: 20,000,000
## Columns: 7
## $ customer_id    <chr> "C201101_0002", "C201101_0002", "C201101_0002", "C20110…
## $ first_tnx_date <date> 2011-01-01, 2011-01-01, 2011-01-01, 2011-01-01, 2011-0…
## $ draw_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ post_mu        <dbl> 0.01524500, 0.09333740, 0.08243720, 0.03675360, 0.02700…
## $ post_tau_mean  <dbl> 65.59540, 10.71380, 12.13040, 27.20820, 37.03190, 78.94…
## $ customer_mu    <dbl> 0.009023485, 0.009023485, 0.009023485, 0.009023485, 0.0…
## $ customer_tau   <dbl> 17.25275, 17.25275, 17.25275, 17.25275, 17.25275, 17.25…

We now use our existing routine to compare the posterior values for mu with the predetermined value.

ltmodel_hier_one_mu_qvalues_tbl <- ltmodel_hier_one_validation_tbl %>%
  calculate_distribution_qvals(post_mu, customer_mu, customer_id, first_tnx_date)

ref_value <- ltmodel_hier_one_mu_qvalues_tbl %>%
  nrow() %>%
  divide_by(50)

ggplot(ltmodel_hier_one_mu_qvalues_tbl) +
  geom_histogram(aes(x = q_val), bins = 50) +
  geom_hline(aes(yintercept = ref_value), colour = "red") +
  labs(
    x = "Quantile",
    y = "Count",
    title = "Quantile Plot of the q-Values for the Posterior Distribution"
    )

ggplot(ltmodel_hier_one_mu_qvalues_tbl) +
  geom_point(aes(x = q_val, y = customer_mu)) +
  labs(
    x = "Quantile",
    y = "Customer Mu",
    title = "Scatterplot of q-Value against Mu"
    )

3.2.2 Calculate hier_one Model Posterior Summary Statistics

We want to calculate some posterior quantities summary statistics.

ltmodel_hier_one_postsummary_tbl <- ltmodel_hier_one_validation_tbl %>%
  group_by(customer_id) %>%
  summarise(
    .groups = "drop",
    
    mu_p10       = quantile(post_mu, 0.10),
    mu_p25       = quantile(post_mu, 0.25),
    mu_median    = median  (post_mu),
    mu_mean      = mean    (post_mu),
    mu_p75       = quantile(post_mu, 0.75),
    mu_p90       = quantile(post_mu, 0.90),
    
    tau_p10      = quantile(post_tau_mean, 0.10),
    tau_p25      = quantile(post_tau_mean, 0.25),
    tau_median   = median  (post_tau_mean),
    tau_mean     = mean    (post_tau_mean),
    tau_p75      = quantile(post_tau_mean, 0.75),
    tau_p90      = quantile(post_tau_mean, 0.90),
    
    customer_mu  = unique(customer_mu),
    customer_tau = unique(customer_tau)
    )

ltmodel_hier_one_postsummary_tbl %>% glimpse()
## Rows: 10,000
## Columns: 15
## $ customer_id  <chr> "C201101_0002", "C201101_0004", "C201101_0006", "C201101_…
## $ mu_p10       <dbl> 0.006584319, 0.001911039, 0.005625193, 0.007399732, 0.005…
## $ mu_p25       <dbl> 0.013751725, 0.003495250, 0.010576075, 0.013584125, 0.010…
## $ mu_median    <dbl> 0.025724750, 0.006117480, 0.017723200, 0.024568500, 0.018…
## $ mu_mean      <dbl> 0.032192187, 0.007715464, 0.021957003, 0.031732767, 0.022…
## $ mu_p75       <dbl> 0.04271982, 0.01048665, 0.02902128, 0.04240180, 0.0301956…
## $ mu_p90       <dbl> 0.06496142, 0.01554940, 0.04342043, 0.06614218, 0.0448212…
## $ tau_p10      <dbl> 15.39372, 64.31107, 23.03065, 15.11899, 22.31086, 15.6097…
## $ tau_p25      <dbl> 23.40835, 95.35902, 34.45745, 23.58390, 33.11742, 23.9691…
## $ tau_median   <dbl> 38.87300, 163.46600, 56.42320, 40.70250, 54.97885, 40.469…
## $ tau_mean     <dbl> 72.87064, 296.78178, 90.39908, 67.36516, 95.18223, 66.654…
## $ tau_p75      <dbl> 72.71855, 286.10250, 94.55350, 73.61538, 97.34675, 70.651…
## $ tau_p90      <dbl> 151.8764, 523.2762, 177.7716, 135.1405, 178.2489, 134.418…
## $ customer_mu  <dbl> 0.009023485, 0.016847819, 0.048667792, 0.141250317, 0.058…
## $ customer_tau <dbl> 17.252751, 155.215456, 28.310863, 9.680664, 36.956593, 4.…

We can now visualise these summary statistics.

plot_tbl <- ltmodel_hier_one_postsummary_tbl %>%
  slice_sample(n = 25) %>%
  arrange(customer_id)

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
    width = 0, size = 1) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
    width = 0, size = 4) +
  geom_point(
    aes(x = customer_id, y = customer_mu),
    colour = "red", size = 2.5) +
  labs(
    x = "Customer ID",
    y = "Posterior Mu",
    title = "Summary Statistics of Mu Values"
    ) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
    width = 0, size = 1) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
    width = 0, size = 4) +
  geom_point(
    aes(x = customer_id, y = customer_tau),
    colour = "red", size = 2.5) +
  scale_y_log10(labels = label_comma()) +
  labs(
    x = "Customer ID",
    y = "Posterior Tau",
    title = "Summary Statistics of Tau Values"
    ) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

3.2.3 Investigate Extreme q-Value

As we have done with our “Tight Prior” model, we look at the data where the \(q\) value is very low.

plot_tbl <- ltmodel_hier_one_mu_qvalues_tbl %>%
  filter(q_val < 0.05) %>%
  select(customer_id) %>%
  inner_join(ltmodel_hier_one_postsummary_tbl, by = "customer_id")

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p10, ymax = mu_p90),
    size = 1, width = 0) +
  geom_errorbar(
    aes(x = customer_id, ymin = mu_p25, ymax = mu_p75),
    size = 3, width = 0) +
  geom_point(
    aes(x = customer_id, y = customer_mu),
    colour = "red") +
  labs(
    x = "Customer ID",
    y = "Lifetime Mu",
    title = "Posterior Estimates of Small q-Value Customers") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

We are looking at customer_mu but it might require instead looking at customer_tau.

### This is not 
plot_tbl <- ltmodel_hier_one_mu_qvalues_tbl %>%
  filter(q_val < 0.05) %>%
  select(customer_id) %>%
  inner_join(ltmodel_minmax_postsummary_tbl, by = "customer_id")

ggplot(plot_tbl) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p10, ymax = tau_p90),
    size = 1, width = 0) +
  geom_errorbar(
    aes(x = customer_id, ymin = tau_p25, ymax = tau_p75),
    size = 3, width = 0) +
  geom_point(
    aes(x = customer_id, y = customer_tau),
    colour = "red") +
  labs(
    x = "Customer ID",
    y = "Lifetime Tau",
    title = "Posterior Estimates of Small q-Value Customers") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.5, size = 8))

4 R Environment

options(width = 120L)
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.3 (2022-03-10)
##  os       Ubuntu 20.04.4 LTS
##  system   x86_64, linux-gnu
##  ui       RStudio
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Etc/UTC
##  date     2022-06-07
##  rstudio  2022.02.1+461 Prairie Trillium (server)
##  pandoc   2.17.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package        * version   date (UTC) lib source
##  abind            1.4-5     2016-07-21 [1] RSPM (R 4.1.0)
##  arrayhelpers     1.1-0     2020-02-04 [1] RSPM (R 4.1.0)
##  assertthat       0.2.1     2019-03-21 [1] RSPM (R 4.1.0)
##  backports        1.4.1     2021-12-13 [1] RSPM (R 4.1.0)
##  base64enc        0.1-3     2015-07-28 [1] RSPM (R 4.1.0)
##  bayesplot      * 1.9.0     2022-03-10 [1] RSPM (R 4.1.0)
##  bit              4.0.4     2020-08-04 [1] RSPM (R 4.1.0)
##  bit64            4.0.5     2020-08-30 [1] RSPM (R 4.1.0)
##  bookdown         0.26      2022-04-15 [1] RSPM (R 4.1.0)
##  boot             1.3-28    2021-05-03 [2] CRAN (R 4.1.3)
##  bridgesampling   1.1-2     2021-04-16 [1] RSPM (R 4.1.0)
##  brms           * 2.17.0    2022-06-02 [1] Github (paul-buerkner/brms@a43937c)
##  Brobdingnag      1.2-7     2022-02-03 [1] RSPM (R 4.1.0)
##  broom            0.8.0     2022-04-13 [1] RSPM (R 4.1.0)
##  bslib            0.3.1     2021-10-06 [1] RSPM (R 4.1.0)
##  cachem           1.0.6     2021-08-19 [1] RSPM (R 4.1.0)
##  callr            3.7.0     2021-04-20 [1] RSPM (R 4.1.0)
##  cellranger       1.1.0     2016-07-27 [1] RSPM (R 4.1.0)
##  checkmate        2.0.0     2020-02-06 [1] RSPM (R 4.1.0)
##  cli              3.2.0     2022-02-14 [1] RSPM (R 4.1.0)
##  cmdstanr       * 0.5.2     2022-06-02 [1] Github (stan-dev/cmdstanr@9bbc4eb)
##  coda             0.19-4    2020-09-30 [1] RSPM (R 4.1.0)
##  codetools        0.2-18    2020-11-04 [2] CRAN (R 4.1.3)
##  colorspace       2.0-3     2022-02-21 [1] RSPM (R 4.1.0)
##  colourpicker     1.1.1     2021-10-04 [1] RSPM (R 4.1.0)
##  conflicted     * 1.1.0     2021-11-26 [1] RSPM (R 4.1.0)
##  cowplot        * 1.1.1     2020-12-30 [1] RSPM (R 4.1.0)
##  crayon           1.5.1     2022-03-26 [1] RSPM (R 4.1.0)
##  crosstalk        1.2.0     2021-11-04 [1] RSPM (R 4.1.0)
##  curl             4.3.2     2021-06-23 [1] RSPM (R 4.1.0)
##  DBI              1.1.2     2021-12-20 [1] RSPM (R 4.1.0)
##  dbplyr           2.1.1     2021-04-06 [1] RSPM (R 4.1.0)
##  digest           0.6.29    2021-12-01 [1] RSPM (R 4.1.0)
##  directlabels   * 2021.1.13 2021-01-16 [1] RSPM (R 4.1.0)
##  distributional   0.3.0     2022-01-05 [1] RSPM (R 4.1.0)
##  dplyr          * 1.0.8     2022-02-08 [1] RSPM (R 4.1.0)
##  DT               0.22      2022-03-28 [1] RSPM (R 4.1.0)
##  dygraphs         1.1.1.6   2018-07-11 [1] RSPM (R 4.1.0)
##  ellipsis         0.3.2     2021-04-29 [1] RSPM (R 4.1.0)
##  evaluate         0.15      2022-02-18 [1] RSPM (R 4.1.0)
##  fansi            1.0.3     2022-03-24 [1] RSPM (R 4.1.0)
##  farver           2.1.0     2021-02-28 [1] RSPM (R 4.1.0)
##  fastmap          1.1.0     2021-01-25 [1] RSPM (R 4.1.0)
##  forcats        * 0.5.1     2021-01-27 [1] RSPM (R 4.1.0)
##  fs             * 1.5.2     2021-12-08 [1] RSPM (R 4.1.0)
##  furrr          * 0.2.3     2021-06-25 [1] RSPM (R 4.1.0)
##  future         * 1.24.0    2022-02-19 [1] RSPM (R 4.1.0)
##  gamm4            0.2-6     2020-04-03 [1] RSPM (R 4.1.0)
##  generics         0.1.2     2022-01-31 [1] RSPM (R 4.1.0)
##  ggdist           3.1.1     2022-02-27 [1] RSPM (R 4.1.0)
##  ggplot2        * 3.3.5     2021-06-25 [1] RSPM (R 4.1.0)
##  ggridges         0.5.3     2021-01-08 [1] RSPM (R 4.1.0)
##  globals          0.14.0    2020-11-22 [1] RSPM (R 4.1.0)
##  glue           * 1.6.2     2022-02-24 [1] RSPM (R 4.1.0)
##  gridExtra        2.3       2017-09-09 [1] RSPM (R 4.1.0)
##  gtable           0.3.0     2019-03-25 [1] RSPM (R 4.1.0)
##  gtools           3.9.2     2021-06-06 [1] RSPM (R 4.1.0)
##  haven            2.5.0     2022-04-15 [1] RSPM (R 4.1.0)
##  highr            0.9       2021-04-16 [1] RSPM (R 4.1.0)
##  hms              1.1.1     2021-09-26 [1] RSPM (R 4.1.0)
##  htmltools        0.5.2     2021-08-25 [1] RSPM (R 4.1.0)
##  htmlwidgets      1.5.4     2021-09-08 [1] RSPM (R 4.1.0)
##  httpuv           1.6.5     2022-01-05 [1] RSPM (R 4.1.0)
##  httr             1.4.2     2020-07-20 [1] RSPM (R 4.1.0)
##  igraph           1.3.1     2022-04-20 [1] RSPM (R 4.1.0)
##  inline           0.3.19    2021-05-31 [1] RSPM (R 4.1.0)
##  jquerylib        0.1.4     2021-04-26 [1] RSPM (R 4.1.0)
##  jsonlite         1.8.0     2022-02-22 [1] RSPM (R 4.1.0)
##  knitr            1.38      2022-03-25 [1] RSPM (R 4.1.0)
##  labeling         0.4.2     2020-10-20 [1] RSPM (R 4.1.0)
##  later            1.3.0     2021-08-18 [1] RSPM (R 4.1.0)
##  lattice          0.20-45   2021-09-22 [2] CRAN (R 4.1.3)
##  lifecycle        1.0.1     2021-09-24 [1] RSPM (R 4.1.0)
##  listenv          0.8.0     2019-12-05 [1] RSPM (R 4.1.0)
##  lme4             1.1-29    2022-04-07 [1] RSPM (R 4.1.0)
##  loo              2.5.1     2022-03-24 [1] RSPM (R 4.1.0)
##  lubridate        1.8.0     2021-10-07 [1] RSPM (R 4.1.0)
##  magrittr       * 2.0.3     2022-03-30 [1] RSPM (R 4.1.0)
##  markdown         1.1       2019-08-07 [1] RSPM (R 4.1.0)
##  MASS             7.3-55    2022-01-16 [2] CRAN (R 4.1.3)
##  Matrix           1.4-0     2021-12-08 [2] CRAN (R 4.1.3)
##  matrixStats      0.62.0    2022-04-19 [1] RSPM (R 4.1.0)
##  memoise          2.0.1     2021-11-26 [1] RSPM (R 4.1.0)
##  mgcv             1.8-39    2022-02-24 [2] CRAN (R 4.1.3)
##  mime             0.12      2021-09-28 [1] RSPM (R 4.1.0)
##  miniUI           0.1.1.1   2018-05-18 [1] RSPM (R 4.1.0)
##  minqa            1.2.4     2014-10-09 [1] RSPM (R 4.1.0)
##  modelr           0.1.8     2020-05-19 [1] RSPM (R 4.1.0)
##  munsell          0.5.0     2018-06-12 [1] RSPM (R 4.1.0)
##  mvtnorm          1.1-3     2021-10-08 [1] RSPM (R 4.1.0)
##  nlme             3.1-155   2022-01-16 [2] CRAN (R 4.1.3)
##  nloptr           2.0.0     2022-01-26 [1] RSPM (R 4.1.0)
##  parallelly       1.31.0    2022-04-07 [1] RSPM (R 4.1.0)
##  pillar           1.7.0     2022-02-01 [1] RSPM (R 4.1.0)
##  pkgbuild         1.3.1     2021-12-20 [1] RSPM (R 4.1.0)
##  pkgconfig        2.0.3     2019-09-22 [1] RSPM (R 4.1.0)
##  plyr             1.8.7     2022-03-24 [1] RSPM (R 4.1.0)
##  posterior      * 1.2.1     2022-03-07 [1] RSPM (R 4.1.0)
##  prettyunits      1.1.1     2020-01-24 [1] RSPM (R 4.1.0)
##  processx         3.5.3     2022-03-25 [1] RSPM (R 4.1.0)
##  projpred         2.1.1     2022-04-03 [1] RSPM (R 4.1.0)
##  promises         1.2.0.1   2021-02-11 [1] RSPM (R 4.1.0)
##  ps               1.6.0     2021-02-28 [1] RSPM (R 4.1.0)
##  purrr          * 0.3.4     2020-04-17 [1] RSPM (R 4.1.0)
##  quadprog         1.5-8     2019-11-20 [1] RSPM (R 4.1.0)
##  R6               2.5.1     2021-08-19 [1] RSPM (R 4.1.0)
##  Rcpp           * 1.0.8.3   2022-03-17 [1] RSPM (R 4.1.0)
##  RcppParallel     5.1.5     2022-01-05 [1] RSPM (R 4.1.0)
##  readr          * 2.1.2     2022-01-30 [1] RSPM (R 4.1.0)
##  readxl           1.4.0     2022-03-28 [1] RSPM (R 4.1.0)
##  reprex           2.0.1     2021-08-05 [1] RSPM (R 4.1.0)
##  reshape2         1.4.4     2020-04-09 [1] RSPM (R 4.1.0)
##  rlang          * 1.0.2     2022-03-04 [1] RSPM (R 4.1.0)
##  rmarkdown        2.13      2022-03-10 [1] RSPM (R 4.1.0)
##  rmdformats       1.0.3     2021-10-06 [1] RSPM (R 4.1.0)
##  rstan            2.26.11   2022-06-02 [1] local
##  rstantools       2.2.0     2022-04-08 [1] RSPM (R 4.1.0)
##  rstudioapi       0.13      2020-11-12 [1] RSPM (R 4.1.0)
##  rvest            1.0.2     2021-10-16 [1] RSPM (R 4.1.0)
##  sass             0.4.1     2022-03-23 [1] RSPM (R 4.1.0)
##  scales         * 1.2.0     2022-04-13 [1] RSPM (R 4.1.0)
##  sessioninfo      1.2.2     2021-12-06 [1] RSPM (R 4.1.0)
##  shiny            1.7.1     2021-10-02 [1] RSPM (R 4.1.0)
##  shinyjs          2.1.0     2021-12-23 [1] RSPM (R 4.1.0)
##  shinystan        2.6.0     2022-03-03 [1] RSPM (R 4.1.0)
##  shinythemes      1.2.0     2021-01-25 [1] RSPM (R 4.1.0)
##  StanHeaders      2.26.11   2022-06-02 [1] local
##  stringi          1.7.6     2021-11-29 [1] RSPM (R 4.1.0)
##  stringr        * 1.4.0     2019-02-10 [1] RSPM (R 4.1.0)
##  svUnit           1.0.6     2021-04-19 [1] RSPM (R 4.1.0)
##  tensorA          0.36.2    2020-11-19 [1] RSPM (R 4.1.0)
##  threejs          0.3.3     2020-01-21 [1] RSPM (R 4.1.0)
##  tibble         * 3.1.6     2021-11-07 [1] RSPM (R 4.1.0)
##  tidybayes      * 3.0.2     2022-01-05 [1] RSPM (R 4.1.0)
##  tidyr          * 1.2.0     2022-02-01 [1] RSPM (R 4.1.0)
##  tidyselect       1.1.2     2022-02-21 [1] RSPM (R 4.1.0)
##  tidyverse      * 1.3.1     2021-04-15 [1] RSPM (R 4.1.0)
##  tzdb             0.3.0     2022-03-28 [1] RSPM (R 4.1.0)
##  utf8             1.2.2     2021-07-24 [1] RSPM (R 4.1.0)
##  V8               4.1.0     2022-02-06 [1] RSPM (R 4.1.0)
##  vctrs            0.4.1     2022-04-13 [1] RSPM (R 4.1.0)
##  vroom            1.5.7     2021-11-30 [1] RSPM (R 4.1.0)
##  withr            2.5.0     2022-03-03 [1] RSPM (R 4.1.0)
##  xfun             0.30      2022-03-02 [1] RSPM (R 4.1.0)
##  xml2             1.3.3     2021-11-30 [1] RSPM (R 4.1.0)
##  xtable           1.8-4     2019-04-21 [1] RSPM (R 4.1.0)
##  xts              0.12.1    2020-09-09 [1] RSPM (R 4.1.0)
##  yaml             2.3.5     2022-02-21 [1] RSPM (R 4.1.0)
##  zoo              1.8-10    2022-04-15 [1] RSPM (R 4.1.0)
## 
##  [1] /usr/local/lib/R/site-library
##  [2] /usr/local/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
options(width = 80L)